In [ ]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In depth with SVMs: Support Vector Machines

SVM stands for "support vector machines". They are efficient and easy to use estimators. They come in two kinds: SVCs, Support Vector Classifiers, for classification problems, and SVRs, Support Vector Regressors, for regression problems.

Linear SVMs

The SVM module contains LinearSVC, which we already discussed briefly in the section on linear models. Using SVC(kernel="linear") will also yield a linear predictor that is only different in minor technical aspects.

Kernel SVMs

The real power of SVMs lies in using kernels, which allow for non-linear decision boundaries. A kernel defines a similarity measure between data points. The most common are:

linear will give linear decision frontiers. It is the most computationally efficient approach and the one that requires the least amount of data.
poly will give decision frontiers that are polynomial. The order of this polynomial is given by the 'order' argument.
rbf uses 'radial basis functions' centered at each support vector to assemble a decision frontier. The size of the RBFs ultimately controls the smoothness of the decision frontier. RBFs are the most flexible approach, but also the one that will require the largest amount of data.

Predictions in a kernel-SVM are made using the formular

$$ \hat{y} = \text{sign}(\alpha_0 + \sum_{j}\alpha_j y_j k(\mathbf{x^{(j)}}, \mathbf{x})) $$

where $\mathbf{x}^{(j)}$ are training samples, $\mathbf{y}^{(j)}$ the corresponding labels, $\mathbf{x}$ is a test-sample to predict on, $k$ is the kernel, and $\alpha$ are learned parameters.

What this says is "if $\mathbf{x}$ is similar to $\mathbf{x}^{(j)}$ then they probably have the same label", where the importance of each $\mathbf{x}^{(j)}$ for this decision is learned. [Or something much less intuitive about an infinite dimensional Hilbert-space]

Often only few samples have non-zero $\alpha$, these are called the "support vectors" from which SVMs get their name. These are the most discriminant samples.

The most important parameter of the SVM is the regularization parameter $C$, which bounds the influence of each individual sample:

Low C values: many support vectors... Decision frontier = mean(class A) - mean(class B)
High C values: small number of support vectors: Decision frontier fully driven by most discriminant samples

The other important parameters are those of the kernel. Let's look at the RBF kernel in more detail:

$$k(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma ||\mathbf{x} - \mathbf{x'}||^2)$$



In [ ]:

    
from sklearn.metrics.pairwise import rbf_kernel
line = np.linspace(-3, 3, 100)[:, np.newaxis]
kernel_value = rbf_kernel(line, [[0]], gamma=1)
plt.plot(line, kernel_value)

The rbf kernel has an inverse bandwidth-parameter gamma, where large gamma mean a very localized influence for each data point, and small values mean a very global influence. Let's see these two parameters in action:



In [ ]:

    
from figures import plot_svm_interactive
plot_svm_interactive()

Exercise: tune an SVM on the digits dataset



In [ ]:

    
from sklearn import datasets
digits = datasets.load_digits()
X, y = digits.data, digits.target
# split the dataset, apply grid-search